Analysis of Red Wine Data

Kareem Kwong

Nov 27, 2015

Background

I explore the properties of Red Wine. The key goal in this study is to determine which chemical properties influence the quality of red wines.

I will use R to begin exploring the data and trying to discover interesting patterns and relationships.

Univariate Plots Section

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## 
## The following object is masked from 'package:scales':
## 
##     percent
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following objects are masked from 'package:base':
## 
##     as.array, trimws
library(gridExtra)
library(grid)

# Load wines csv
wines <- read.csv('wineQualityReds.csv')
summary(wines)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

From this box plot mapping quality of the wine, we can see most (between 25th - 75th percentile) fall within the range of 4 and 7, with the majority getting a average rating of 5 and 6.

This histogram confirms the visualization from the box plot.

Fixed acidity is normally distributed. - Mean of 8.32 - median of 7.90 - Minimum of 4.60 - Maximum of 15.90

Volatile acidity is also normally distributed. - Mean of 0.5278 - Median of 0.5200 - Minimum of 0.1200 - Maximum of 1.5800

Risidual sugars tend to be very low in red wine. - Mean of 2.539 - Median of 2.200 - Minimum of 0.900 - Maximum of 15.500

On the other hand, citric acid tend to be distributed more to the tail. - Mean of 0.271 - Median of 0.260 - Minimum of 0.000 - Max of 1.000

When controlling for outliers the amount of chloride in red wine has a normal distribution.

When controlling for the outliers at the tail and head of the distribution of free sulfur dioxide, the majority of wines have a low amount of it.

pH is normally distributed. - Mean of 3.311 - Median of 3.310 - Minimum of 2.740 - Maximum of 4.010

Sulphates is also normally distributed, when controlling for the long tail. - Mean of 0.6581 - Median of 0.6200 - Minimum of 0.3300 - Maximum of 2.000

The majority of red wines tend to have lower alcohol levels. - Mean of 10.42 - Median of 10.20 - Minimum of 8.40 - Maximum of 14.90

Finally, the majority of red wines were given an average quality rating. - Mean of 5.636 - Median of 6.000 - Minimum of 3.000 - Maximum of 8.000

To confirm this initial assessment, I added a new categorical variable.

wines$rank <- ifelse(wines$quality <= 4, 'bad', ifelse(
  wines$quality < 7, 'average', 'good'))
wines$rank <- ordered(wines$rank,
                     levels = c('bad', 'average', 'good'))
summary(wines$rank)
##     bad average    good 
##      63    1319     217

Univariate Analysis

What is the structure of your dataset

The tidy data set used contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Looking at the quick summary of variables, we can see that we have 11 variables which describe the properties of Red Wine.

  • fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  • volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  • citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  • residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  • chlorides: the amount of salt in the wine

  • free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  • total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  • density: the density of water is close to that of water depending on the percent alcohol and sugar content

  • pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  • sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  • alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

The main features of interest is quality. The key objective of this analysis is to determine which chemical properties influence the quality of red wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I believe the acidity, pH and sulfur are most likely to contribute to the quality of the red wine.

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Citric acid was the only type of acid that was not normally distributed. After some investigation, I found that a significant number of data rows had a value of 0 for citric acid. However, doing some quick research, this was likely intentional and not an error as citric acid is used less frequently in wine compared to tartaric and malic due to the aggressive citric flavors it can add to the wine.

Bivariate Plot Section

wine_subset <- wines[,2:13]
colnames(wine_subset) = c("F.A", "V.A", "Cit.A", "Sug", "CI", 
                       "F.S02", "S02", "Dens", "pH", "S04","Alc",                          "Q")
ggpairs(wine_subset, params=list(corSize=3)) + theme(axis.text = element_blank())

I first use a matrix to quickly find and display some potential relationships between variables.

Using scatter plots, I can quickly see the patterns between the two variables and whether the correlation figures seem correct.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From these plots, I observed: - There is a strong positive correlation between alcohol and quality (0.476) - There is strong postive correlation between density and fixed acidity (0.669) - There is a strong negative correlation between pH and fixed acidity (-0.683) - Strong negative correlation between pH and citric acid (-0.542) - Strong negative correlation between alcohol and density (-0.496)

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Not knowing much about wine, I found many interesting relationships. The first of which are that citric acid, pH and alcohol all have strong relationships with the quality of the wine.

In addition, a wine that has higher density generally has a higher fixed acidity. A wine with a high alcohol level also tends to have higher chlorides. Finally, a wine that has higher pH will have a lower level of citric acid and fixed acidity.

What was the strongest relationship you found?

The strongest relationships found were between alcohol and chlorides as well as alcohol and quality.

Multivariate Plot Section

Looking at the same plots as in the bivariate section, except this time I use rank instead of quality.

Taking a closer look at the relationship between alcohol and pH, using rank we are able to better see the distinction between the three categories of wine. Despite my hypothesis that more basic wines would be rated less highly, this does not appear to be the case.

However, there definitely is a strong correlation between alcohol % and quality. I chose to use a box plot in order to display the quartile information and show the distribution of the wines by quality effectively.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • Higher quality red wines tend to have higher density and higher citric acid
  • Higher quality red wines tend to have lower pH and higher fixed acidity, which is consistent with the observation above.
  • Higher quality red wines tend to have higher quality

Final Plots and Summary

Histogram of Wine Quality

This plot was created to show the majority of wines were given an average rating and therefore the dataset is normally distributed.

Relationship between pH, alcohol and quality

This plot was created to show the relationship between pH, alcohol and quality and how quality red wines tend to have a higher alcohol level coupled with lower pH.

How strong is the relationship with alcohol

These box plots emphasize the strong positive correlation between alcohol and wine quality.

Reflection

Through this assignment I was able to learn more about wines and the factors attributed to their quality. However, as the majority of wines were given an average rating this dataset is not necessarily the end all of analysis in this area. This is likely due to the fact that the quality of wines are often subjective for each individual.

Through this analysis I think the largest struggle for me was deciding the appropriate plots to use in the analysis. Ultimately I decided that distribution analysis seemed to be the best way to approach the ranking of quality, so I relied mostly on histograms, scatter plots and box plots.

I think the greatest success was having a set of graphs by the time I arrived at the final section with which I can use to come up with a reasonable story and conclusion. Coming from a business background, it was exciting approaching a problem from an analytical perspective rather than merely empircally or theoretically.

In the future, I would like to do a more detailed breakdown of the different factors that affect the quality of the red wine. Perhaps data on the demographics and psychographics of the wine experts would be interesting to look at as well. I feel that my results merely confirmed suspicions I had before combing through the dataset.